---
title: Content moderation system
description: Learn how to build a content moderation system that can analyze user-generated content and provide feedback on its appropriateness.
---

<Card
title="Live example"
href="https://app.latitude.so/share/d/9ed3ab72-5492-4cec-b490-71112bc608b9"
arrow="true"
cta="Copy to your Latitude">
You can play with this example in the Latitude Playground.
</Card>

## Overview

In this example, we will create a content moderation system that can analyze user-generated content and provide feedback on its appropriateness. The agent uses subagents to handle different aspects of content moderation efficiently.

## Multi-Agent Architecture

The system uses specialized subagents for different responsibilities:

- **main**: Coordinates the moderation process by dispatching content to all subagents, gathering their evaluations, and generating the final decision based on their collective input.
- **rule_checker**: Runs deterministic, rule-based checks—such as profanity filters or length validation—against the content, ensuring compliance with explicitly defined policies.
- **toxicity_analyzer**: Analyzes content for toxicity and subtle forms of harm like harassment, hate speech, or threats, taking context and intent into account, even in ambiguous or nuanced cases.
- **safety_scorer**: Calculates comprehensive risk and safety scores for the content, highlighting any areas of concern, escalation potential, or need for human review.

<Note>
All the tools used in the sub-agents have to be defined in the main prompt.
</Note>

## The prompts

<CodeGroup>
```markdown main
---
provider: google
model: gemini-1.5-flash
temperature: 0.2
type: agent
agents:
  - rule_checker
  - toxicity_evaluator
  - safety_scorer
tools:
  - check_profanity_filter:
      description: Detect explicit language and banned words in content
      parameters:
        type: object
        properties:
          content:
            type: string
            description: The content to check for profanity
          content_type:
            type: string
            description: Type of content (text, comment, post, etc.)
        required: [content]

  - validate_content_length:
      description: Ensure content meets platform length guidelines
      parameters:
        type: object
        properties:
          content:
            type: string
            description: The content to validate
          content_type:
            type: string
            description: Type of content to determine length limits
        required: [content, content_type]

  - scan_for_patterns:
      description: Identify suspicious patterns and spam indicators
      parameters:
        type: object
        properties:
          content:
            type: string
            description: The content to scan for patterns
          pattern_types:
            type: array
            items:
              type: string
            description: Types of patterns to look for (spam, repetitive, etc.)
        required: [content]
schema:
  type: object
  properties:
    decision:
      type: string
      enum: [approve, flag, reject]
      description: The final moderation decision
    confidence:
      type: number
      minimum: 0
      maximum: 1
      description: Confidence score for the decision
    reasoning:
      type: string
      description: Brief explanation for the decision
    violations:
      type: array
      items:
        type: string
      description: List of policy violations found
    recommended_action:
      type: string
      description: Specific action to take
  required: [decision, confidence, reasoning]
---

<system>
You are the main coordinator for an intelligent content moderation system. Your role is to orchestrate the moderation pipeline by delegating tasks to specialized agents and making final moderation decisions.

You have access to three specialized agents:
1. rule_checker - Applies programmatic rules and filters
2. toxicity_evaluator - Uses LLM-as-judge for nuanced content analysis
3. safety_scorer - Calculates safety metrics and risk scores

Process each content submission through all agents and synthesize their outputs into a final moderation decision.
</system>

<user>
Content to moderate: {{ content }}
Content type: {{ content_type }}
Platform context: {{ platform_context }}
</user>
```
```markdown rule_checker
---
provider: OpenAI
model: gpt-4o-mini
temperature: 0.1
type: agent
schema:
  type: object
  properties:
    rule_violations:
      type: array
      items:
        type: string
      description: List of violated rules
    severity:
      type: string
      enum: [low, medium, high]
      description: Overall severity level
    details:
      type: string
      description: Specific findings from rule checks
    passed_basic_filters:
      type: boolean
      description: Whether content passed basic filtering
  required: [rule_violations, severity, passed_basic_filters]
---

<system>
You are a rule-based content filter that applies programmatic rules to detect policy violations. You focus on deterministic, rule-based checks that can be applied consistently.

Use the provided tools to check content against various rules and filters. Be thorough but efficient in your rule application.
</system>

<user>
Content: {{ content }}
Content type: {{ content_type }}
</user>
```
```markdown safety_scorer
---
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.1
type: agent
schema:
  type: object
  properties:
    safety_scores:
      type: object
      properties:
        immediate_harm_risk:
          type: integer
          minimum: 0
          maximum: 100
          description: Risk of immediate harm (higher = more risk)
        community_impact:
          type: integer
          minimum: 0
          maximum: 100
          description: Risk to community health (higher = more risk)
        policy_violation_severity:
          type: integer
          minimum: 0
          maximum: 100
          description: Severity of policy violations (higher = more severe)
        escalation_potential:
          type: integer
          minimum: 0
          maximum: 100
          description: Likelihood of escalation (higher = more likely)
        context_sensitivity:
          type: integer
          minimum: 0
          maximum: 100
          description: Context-specific risk (higher = more risk)
      required: [immediate_harm_risk, community_impact, policy_violation_severity, escalation_potential, context_sensitivity]
    overall_risk_score:
      type: integer
      minimum: 0
      maximum: 100
      description: Weighted average risk score
    confidence_interval:
      type: array
      items:
        type: integer
      minItems: 2
      maxItems: 2
      description: Lower and upper bounds of confidence interval
    requires_human_review:
      type: boolean
      description: Whether human review is recommended
    monitoring_level:
      type: string
      enum: [none, light, heavy]
      description: Suggested monitoring level
    risk_factors:
      type: array
      items:
        type: string
      description: Specific factors contributing to risk
  required: [safety_scores, overall_risk_score, requires_human_review, monitoring_level]
---

<system>
You are a safety metrics specialist that calculates comprehensive risk scores for content moderation decisions. You use negative evaluation scoring for safety-critical metrics, meaning higher scores indicate greater safety concerns.

Your role is to:
- Calculate risk scores across multiple safety dimensions
- Provide confidence intervals for moderation decisions
- Flag edge cases requiring human review
- Generate quantitative safety metrics

Use negative scoring where higher values indicate higher risk/safety concerns.
</system>

<user>
Content: {{ content }}
Rule checker results: {{ rule_results }}
Toxicity evaluation: {{ toxicity_results }}
</user>
```
```markdown toxicity_evaluator
---
provider: anthropic
model: claude-3-5-sonnet-20241022
temperature: 0.3
type: agent
schema:
  type: object
  properties:
    toxicity_detected:
      type: boolean
      description: Whether toxicity was detected
    toxicity_type:
      type: string
      enum: [harassment, hate_speech, threat, other, none]
      description: Type of toxicity found
    severity_score:
      type: integer
      minimum: 1
      maximum: 10
      description: Severity rating from 1-10
    confidence:
      type: number
      minimum: 0
      maximum: 1
      description: Confidence in the assessment
    reasoning:
      type: string
      description: Detailed explanation of the assessment
    context_factors:
      type: array
      items:
        type: string
      description: Factors that influenced the decision
  required: [toxicity_detected, severity_score, confidence, reasoning]
---

<system>
You are an expert content safety evaluator specializing in detecting toxicity, harassment, and harmful content. You excel at understanding context, nuance, and implicit harm that rule-based systems might miss.

Evaluate content for:
- Contextual toxicity (sarcasm, implicit harm)
- Cultural sensitivity issues
- Intent classification (harassment, hate speech, threats)
- Severity assessment on a graduated scale

Consider context, cultural nuances, and potential for harm. Be especially careful about edge cases and borderline content.
</system>

<user>
Content: {{ content }}
Platform context: {{ platform_context }}
User history: {{ user_history }}
</user>
```
</CodeGroup>

## Breakdown

Let's break down the example step by step to understand how it works.

<Steps>
  <Step title="Main Prompt">
    The main prompt acts as the central coordinator. It receives user-generated content, delegates the moderation tasks to the specialized subagents, aggregates their results, and produces a structured final decision with confidence and reasoning.
  </Step>
  <Step title="rule_checker">
    The rule_checker agent checks for clear, rule-based violations—like banned words, excessive length, or explicit policy breaches—using programmatic filters and deterministic logic.
  </Step>
  <Step title="toxicity_analyzer">
    The toxicity_analyzer (or toxicity_evaluator) uses advanced AI to evaluate whether the content contains toxicity, harassment, hate speech, or other forms of harmful language, considering nuance, context, and potential for implicit harm.
  </Step>
  <Step title="safety_scorer">
    The safety_scorer calculates various risk scores for the content, such as immediate harm, community impact, and escalation risk, and determines whether the situation requires human review or additional monitoring.
  </Step>
  <Step title="Final Decision">
    The main agent synthesizes all subagent outputs, weighing rule violations, toxicity, and risk scores to make a final moderation decision. This decision includes a confidence score, explanation, and recommended action for handling the content.
  </Step>
</Steps>

## Structured Output

Main prompt returns a [structured output](/guides/prompt-manager/json-output) because the moderation process must be machine-readable and reliable, allowing easy integration with other systems and clear auditing of every moderation decision.

## Code

In the code we prepared 4 cases of possible user input from different sources. In github you [have the code](https://github.com/latitude-dev/latitude-llm/blob/main/examples/package.json#L34) but the idea is to launch this code with different types of possible input to see how it works.

```bash
pnpm run ts:cases:content_moderation --type toxicity
```

The important part is that you can see the use of tools. The tools defined in the code are used to respond to the tools defined in the main prompt. These kind of tools are on your control and are things that usually don't need an LLM or AI to be responded like measure the length of the text of if the the text contains words that yout put in a blacklist.

<CodeGroup>
````typescript Typescript
import { Latitude } from '@latitude-data/sdk'

function printStatus(message: string) {
  process.stdout.write('\r' + message + ' '.repeat(30)) // Pad to overwrite old text
}

const MODEREATION_TYPES = {
  allGood: 'allGood',
  spam: 'spam',
  toxicity: 'toxicity',
  badWords: 'badWords',
} as const

type Content = {
  content: string
  content_type: string // post, comment, message, etc.
  platform_context: string // e.g., 'social_media', 'email', 'forum'
}
const CONTENT_BY_MODERATION_TYPE: Record<
  keyof typeof MODEREATION_TYPES,
  Content
> = {
  [MODEREATION_TYPES.allGood]: {
    content: 'The world is a beautiful place and I love it.',
    platform_context: 'Twitter',
    content_type: 'post',
  },
  [MODEREATION_TYPES.spam]: {
    content: 'Nigerian Prince wants to transfer money to you.',
    platform_context: 'email',
    content_type: 'Email message',
  },
  [MODEREATION_TYPES.toxicity]: {
    content: 'You are an idiot and nobody likes you.',
    platform_context: 'Reddit',
    content_type: 'comment',
  },
  [MODEREATION_TYPES.badWords]: {
    content: 'I hate Tomatoes because they could kill me',
    platform_context: 'Instagram',
    content_type: 'post',
  },
}

type Tools = {
  check_profanity_filter: { content: string; content_type: string }
  validate_content_lenght: { content: string; content_type: string }
  scan_for_patterns: {
    content: string
    content_type: 'spam' | 'phishing' | 'repititive'
  }
}

async function run({
  moderationType,
}: {
  moderationType: keyof typeof MODEREATION_TYPES
}) {
  const sdk = new Latitude(process.env.LATITUDE_API_KEY, {
    projectId: Number(process.env.PROJECT_ID),
    versionUuid: 'live',
  })

  try {
    const result = await sdk.prompts.run<Tools>(
      'content-moderation-system/main',
      {
        parameters: CONTENT_BY_MODERATION_TYPE[moderationType],
        stream: true,
        onEvent: (event) => {
          printStatus(`Generating response... ${event.data.type}`)
        },
        tools: {
          check_profanity_filter: async ({ content }) => {
            if (content.includes('Tomatoes')) {
              return {
                content_type: 'badWords',
                description: 'Content contains prohibited words.',
              }
            }

            return {
              content_type: 'ok',
              description:
                'Content is clean and does not contain prohibited words.',
            }
          },
          validate_content_lenght: async ({ content: _c }) => {
            return 'ok' // Assuming content length is valid for this example
          },
          scan_for_patterns: async ({ content }) => {
            if (moderationType === 'spam') {
              if (content.includes('Nigerian Prince')) {
                return {
                  content_type: 'spam',
                  description:
                    'This content appears to be spam, possibly a scam involving a Nigerian Prince.',
                }
              }
            }

            return {
              content_type: 'ok',
              description:
                'Content is clean and does not match any known patterns.',
            }
          },
        },
      },
    )

    const response = result.response
    console.log('Agent Response: \n', JSON.stringify(response, null, 2))
  } catch (error) {
    console.error('Error: ', error.message, '\nStack:', error.stack)
  }
}

const [, , ...args] = process.argv

const moderationType = MODEREATION_TYPES[args[1]]

if (!moderationType) {
  console.error('Invalid moderation type. Please use one of the following: \n')
  Object.keys(MODEREATION_TYPES).forEach((type) => {
    console.error(`pnpm run ts:cases:content_moderation --type ${type} \n`)
  })
  process.exit(1)
}

run({ moderationType })
````
</CodeGroup>

## Resources

- [Custom Tools](/guides/prompt-manager/tools) - How to integrate with customer databases and CRM systems
- [Tool call SDK example](/examples/sdk/run-prompt-with-tools) - A simple example of how to run a prompt with tools with Latitude SDK.
- [JSON Schema Output](/guides/prompt-manager/json-output) - Ensuring consistent response formatting
